Web Scraping

Biostat 203B

Author

Dr. Hua Zhou @ UCLA

Published

February 2, 2026

Display machine information for reproducibility.

sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.5.2    fastmap_1.2.0     cli_3.6.5        
 [5] tools_4.5.2       htmltools_0.5.9   otel_0.2.0        rstudioapi_0.17.1
 [9] yaml_2.3.12       rmarkdown_2.30    knitr_1.51        jsonlite_2.0.0   
[13] xfun_0.55         digest_0.6.39     rlang_1.1.7       evaluate_1.0.5   

Load tidyverse and other packages for this lecture.

library("chromote")
library("xml2")
library("rvest")
library("quantmod")
Loading required package: xts
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
Loading required package: TTR
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks stats::filter()
✖ dplyr::first()          masks xts::first()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag()            masks stats::lag()
✖ dplyr::last()           masks xts::last()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

1 Web scraping

There is a wealth of data on internet. How to scrape them and analyze them?

2 rvest

rvest is an R package written by Hadley Wickham which makes web scraping easy.

3 Example: Scraping from webpage

# Specifying the url for desired website to be scraped
url <- "https://www.imdb.com/search/title/?title_type=feature&release_date=2024-01-01,2024-12-31&count=100"
# Reading the HTML code from the website
(webpage <- read_html_live(url))
{xml_nodeset (144)}
 [1] <script async="" src="https://images-na.ssl-images-amazon.com/images/I/2 ...
 [2] <meta charset="utf-8">\n
 [3] <meta name="viewport" content="width=device-width">\n
 [4] <script>if(typeof uet === 'function'){ uet('bb', 'LoadTitle', {wb: 1});  ...
 [5] <title>Movie, Release date between 2024-01-01 and 2024-12-31 (Sorted by  ...
 [6] <meta name="google-site-verification" content="0cadf7898134e79b">\n
 [7] <meta name="msvalidate.01" content="C1DACEF2769068C0B0D2687C9E5105FA">\n
 [8] <meta name="robots" content="max-image-preview:large">\n
 [9] <meta property="og:url" content="https://www.imdb.com/search/">\n
[10] <meta property="og:site_name" content="IMDb">\n
[11] <meta property="og:type" content="website">\n
[12] <meta property="og:image" content="https://m.media-amazon.com/images/G/0 ...
[13] <meta property="og:image:height" content="1000">\n
[14] <meta property="og:image:width" content="1000">\n
[15] <meta content="en_US" property="og:locale">\n
[16] <meta content="es_ES" property="og:locale:alternate">\n
[17] <meta content="es_MX" property="og:locale:alternate">\n
[18] <meta content="fr_FR" property="og:locale:alternate">\n
[19] <meta content="fr_CA" property="og:locale:alternate">\n
[20] <meta content="it_IT" property="og:locale:alternate">\n
...
  • Suppose we want to scrape following 11 features from this page:
    • Rank (popularity)
    • Title
    • Description
    • Runtime
    • Film rating
    • User rating
    • Metascore
    • Votes

3.1 Rank and title

  • Use SelectorGadget to find the CSS selector .ipc-title-link-wrapper .ipc-title__text.
# Using CSS selectors to scrap the title section
ranktitle_data <- webpage |> 
  html_elements('.ipc-title-link-wrapper .ipc-title__text') |>
  html_text() |>
  print()
  [1] "1. The Life of Chuck"                      
  [2] "2. The Substance"                          
  [3] "3. Trap"                                   
  [4] "4. Beetlejuice Beetlejuice"                
  [5] "5. Anora"                                  
  [6] "6. Relay"                                  
  [7] "7. Eden"                                   
  [8] "8. Dune: Part Two"                         
  [9] "9. Bone Lake"                              
 [10] "10. The Brutalist"                         
 [11] "11. Civil War"                             
 [12] "12. The Wild Robot"                        
 [13] "13. Blink Twice"                           
 [14] "14. Wicked"                                
 [15] "15. A Complete Unknown"                    
 [16] "16. Gladiator II"                          
 [17] "17. Nosferatu"                             
 [18] "18. Deadpool & Wolverine"                  
 [19] "19. Heretic"                               
 [20] "20. Nightbitch"                            
 [21] "21. Babygirl"                              
 [22] "22. The Fall Guy"                          
 [23] "23. The Ministry of Ungentlemanly Warfare" 
 [24] "24. Damaged"                               
 [25] "25. Alien: Romulus"                        
 [26] "26. Conclave"                              
 [27] "27. Argylle"                               
 [28] "28. Secret Mall Apartment"                 
 [29] "29. Kinds of Kindness"                     
 [30] "30. Challengers"                           
 [31] "31. Ghostbusters: Frozen Empire"           
 [32] "32. MaXXXine"                              
 [33] "33. We Bury the Dead"                      
 [34] "34. I'm Still Here"                        
 [35] "35. Ordinary Angels"                       
 [36] "36. The Beekeeper"                         
 [37] "37. Friendship"                            
 [38] "38. Longlegs"                              
 [39] "39. It Ends with Us"                       
 [40] "40. Speak No Evil"                         
 [41] "41. Wolfs"                                 
 [42] "42. We Live in Time"                       
 [43] "43. Abigail"                               
 [44] "44. Small Things Like These"               
 [45] "45. Furiosa: A Mad Max Saga"               
 [46] "46. The Apprentice"                        
 [47] "47. The Killer's Game"                     
 [48] "48. The Instigators"                       
 [49] "49. Flow"                                  
 [50] "50. The Idea of You"                       
 [51] "51. Road House"                            
 [52] "52. Shell"                                 
 [53] "53. Twisters"                              
 [54] "54. Kraven the Hunter"                     
 [55] "55. Caddo Lake"                            
 [56] "56. The Order"                             
 [57] "57. Juror #2"                              
 [58] "58. Carry-On"                              
 [59] "59. The Count of Monte-Cristo"             
 [60] "60. Subservience"                          
 [61] "61. Bird"                                  
 [62] "62. The Watchers"                          
 [63] "63. Rebel Ridge"                           
 [64] "64. Tell Me What You Want"                 
 [65] "65. Immaculate"                            
 [66] "66. Freaky Tales"                          
 [67] "67. Venom: The Last Dance"                 
 [68] "68. He's Watching You"                     
 [69] "69. Borderlands"                           
 [70] "70. Inside Out 2"                          
 [71] "71. A Quiet Place: Day One"                
 [72] "72. Moana 2"                               
 [73] "73. Damsel"                                
 [74] "74. Fly Me to the Moon"                    
 [75] "75. Sonic the Hedgehog 3"                  
 [76] "76. Joker: Folie à Deux"                   
 [77] "77. The Unholy Trinity"                    
 [78] "78. A Real Pain"                           
 [79] "79. Elevation"                             
 [80] "80. Smile 2"                               
 [81] "81. IF"                                    
 [82] "82. Madame Web"                            
 [83] "83. The Shrouds"                           
 [84] "84. Bad Boys: Ride or Die"                 
 [85] "85. The Strangers: Chapter 1"              
 [86] "86. Kingdom of the Planet of the Apes"     
 [87] "87. Don't Let's Go to the Dogs Tonight"    
 [88] "88. Despicable Me 4"                       
 [89] "89. Love Lies Bleeding"                    
 [90] "90. Queer"                                 
 [91] "91. Strange Harvest"                       
 [92] "92. The Damned"                            
 [93] "93. Kung Fu Panda 4"                       
 [94] "94. The Luckiest Man in America"           
 [95] "95. Fight or Flight"                       
 [96] "96. The Assessment"                        
 [97] "97. On Swift Horses"                       
 [98] "98. Megalopolis"                           
 [99] "99. I Saw the TV Glow"                     
[100] "100. Horizon: An American Saga - Chapter 1"
# rank
rank_data <- str_extract(ranktitle_data, "^[0-9]+") |> 
  as.integer() |> 
  print()
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100
# title
title_data <- str_remove(ranktitle_data, "^[0-9]+. ") |> 
  print()   
  [1] "The Life of Chuck"                    
  [2] "The Substance"                        
  [3] "Trap"                                 
  [4] "Beetlejuice Beetlejuice"              
  [5] "Anora"                                
  [6] "Relay"                                
  [7] "Eden"                                 
  [8] "Dune: Part Two"                       
  [9] "Bone Lake"                            
 [10] "The Brutalist"                        
 [11] "Civil War"                            
 [12] "The Wild Robot"                       
 [13] "Blink Twice"                          
 [14] "Wicked"                               
 [15] "A Complete Unknown"                   
 [16] "Gladiator II"                         
 [17] "Nosferatu"                            
 [18] "Deadpool & Wolverine"                 
 [19] "Heretic"                              
 [20] "Nightbitch"                           
 [21] "Babygirl"                             
 [22] "The Fall Guy"                         
 [23] "The Ministry of Ungentlemanly Warfare"
 [24] "Damaged"                              
 [25] "Alien: Romulus"                       
 [26] "Conclave"                             
 [27] "Argylle"                              
 [28] "Secret Mall Apartment"                
 [29] "Kinds of Kindness"                    
 [30] "Challengers"                          
 [31] "Ghostbusters: Frozen Empire"          
 [32] "MaXXXine"                             
 [33] "We Bury the Dead"                     
 [34] "I'm Still Here"                       
 [35] "Ordinary Angels"                      
 [36] "The Beekeeper"                        
 [37] "Friendship"                           
 [38] "Longlegs"                             
 [39] "It Ends with Us"                      
 [40] "Speak No Evil"                        
 [41] "Wolfs"                                
 [42] "We Live in Time"                      
 [43] "Abigail"                              
 [44] "Small Things Like These"              
 [45] "Furiosa: A Mad Max Saga"              
 [46] "The Apprentice"                       
 [47] "The Killer's Game"                    
 [48] "The Instigators"                      
 [49] "Flow"                                 
 [50] "The Idea of You"                      
 [51] "Road House"                           
 [52] "Shell"                                
 [53] "Twisters"                             
 [54] "Kraven the Hunter"                    
 [55] "Caddo Lake"                           
 [56] "The Order"                            
 [57] "Juror #2"                             
 [58] "Carry-On"                             
 [59] "The Count of Monte-Cristo"            
 [60] "Subservience"                         
 [61] "Bird"                                 
 [62] "The Watchers"                         
 [63] "Rebel Ridge"                          
 [64] "Tell Me What You Want"                
 [65] "Immaculate"                           
 [66] "Freaky Tales"                         
 [67] "Venom: The Last Dance"                
 [68] "He's Watching You"                    
 [69] "Borderlands"                          
 [70] "Inside Out 2"                         
 [71] "A Quiet Place: Day One"               
 [72] "Moana 2"                              
 [73] "Damsel"                               
 [74] "Fly Me to the Moon"                   
 [75] "Sonic the Hedgehog 3"                 
 [76] "Joker: Folie à Deux"                  
 [77] "The Unholy Trinity"                   
 [78] "A Real Pain"                          
 [79] "Elevation"                            
 [80] "Smile 2"                              
 [81] "IF"                                   
 [82] "Madame Web"                           
 [83] "The Shrouds"                          
 [84] "Bad Boys: Ride or Die"                
 [85] "The Strangers: Chapter 1"             
 [86] "Kingdom of the Planet of the Apes"    
 [87] "Don't Let's Go to the Dogs Tonight"   
 [88] "Despicable Me 4"                      
 [89] "Love Lies Bleeding"                   
 [90] "Queer"                                
 [91] "Strange Harvest"                      
 [92] "The Damned"                           
 [93] "Kung Fu Panda 4"                      
 [94] "The Luckiest Man in America"          
 [95] "Fight or Flight"                      
 [96] "The Assessment"                       
 [97] "On Swift Horses"                      
 [98] "Megalopolis"                          
 [99] "I Saw the TV Glow"                    
[100] "Horizon: An American Saga - Chapter 1"

3.2 Description

description_data <- webpage |>
  html_elements('.ipc-html-content-inner-div') |>
  html_text() |>
  print()
  [1] "A life-affirming, genre-bending story about three chapters in the life of an ordinary man named Charles Krantz."                                                                                                                                  
  [2] "A fading celebrity takes a black-market drug: a cell-replicating substance that helps her create a younger, better version of herself."                                                                                                           
  [3] "A father and his teen daughter attend a pop concert only to realize they've entered the center of a dark and sinister event."                                                                                                                     
  [4] "After a family tragedy, three generations of the Deetz family return home to Winter River. Still haunted by Beetlejuice, Lydia's life is turned upside down when her teenage daughter, Astrid, accidentally opens the portal to the Afterlife."   
  [5] "A young stripper from Brooklyn meets and impulsively marries the son of a Russian oligarch. Once the news reaches Russia, her fairy tale is threatened as his parents set out for New York to get the marriage annulled."                         
  [6] "A broker of lucrative payoffs between corrupt corporations and the individuals who threaten them breaks his own rules when a new client seeks his protection to stay alive."                                                                      
  [7] "Based on a factual account of a group of outsiders who settle on a remote island only to discover their greatest threat isn't the brutal climate or deadly wildlife, but each other."                                                             
  [8] "Paul Atreides unites with the Fremen while on a warpath of revenge against the conspirators who destroyed his family. Facing a choice between the love of his life and the fate of the universe, he endeavors to prevent a terrible future."      
  [9] "A couple's vacation at a secluded estate is upended when they're forced to share the mansion with a mysterious couple. A dream getaway spirals into a nightmarish maze of sex, lies, and manipulation, triggering a battle for survival."         
 [10] "A visionary architect flees post-war Europe in 1947 for a brighter future in the United States and finds his life forever changed by a wealthy client."                                                                                           
 [11] "In a dystopian future, four journalists travel across the United States during a nation-wide conflict. While trying to survive, they aim to reach the White House to interview the president before he is overthrown."                            
 [12] "After a shipwreck, an intelligent robot called Roz is stranded on an uninhabited island. To survive the harsh environment, Roz bonds with the island's animals and cares for an orphaned baby goose."                                             
 [13] "When tech billionaire Slater King meets cocktail waitress Frida at his fundraising gala, he invites her to join him and his friends on a dream vacation on his private island. As strange things start to happen, Frida questions her reality."   
 [14] "Elphaba, a young woman ridiculed for her green skin, and Galinda, a popular girl, become friends at Shiz University in the Land of Oz. After an encounter with the Wonderful Wizard of Oz, their friendship reaches a crossroads."                
 [15] "In 1961, an unknown 19-year-old Bob Dylan arrives in New York City with his guitar and forges relationships with musical icons on his meteoric rise, culminating in a groundbreaking performance that reverberates around the world."             
 [16] "After his home is conquered by the tyrannical emperors who now lead Rome, Lucius is forced to enter the Colosseum and must look to his past to find strength to return the glory of Rome to its people."                                          
 [17] "A gothic tale of obsession between a haunted young woman and the terrifying vampire infatuated with her, causing untold horror in its wake."                                                                                                      
 [18] "Deadpool is offered a place in the Marvel Cinematic Universe by the Time Variance Authority, but instead recruits a variant of Wolverine to save his universe from extinction."                                                                   
 [19] "Two young religious women are drawn into a game of cat-and-mouse in the house of a strange man."                                                                                                                                                  
 [20] "A woman pauses her career to be a stay-at-home mom, but soon her domesticity takes a surreal turn."                                                                                                                                               
 [21] "A high-powered CEO puts her career and family on the line when she begins a torrid affair with her much-younger intern."                                                                                                                          
 [22] "A stuntman, fresh off an almost career-ending accident, has to track down a missing movie star, solve a conspiracy and try to win back the love of his life while still doing his day job."                                                       
 [23] "The British military recruits a small group of highly skilled soldiers to strike against German forces behind enemy lines during World War II."                                                                                                   
 [24] "Chicago detective Dan Lawson travels to Scotland to link up with Scottish detective Glen Boyd following the resurgence of a serial killer whose crimes match an unsolved case he investigated 5 years previously."                                
 [25] "While scavenging the deep ends of a derelict space station, a group of young space colonists come face to face with the most terrifying life form in the universe."                                                                               
 [26] "When Cardinal Lawrence is tasked with leading the selection of a new Pope, he finds himself in a web of conspiracy and intrigue that could shake the very foundations of the Catholic Church."                                                    
 [27] "A reclusive author who writes espionage novels about a secret agent and a global spy syndicate realizes that the plot of the new book she's writing starts to mirror real-world events in real time."                                             
 [28] "In 2003, eight Rhode Islanders created a secret apartment inside a busy mall and lived there for four years, filming everything along the way. Far more than a prank, the secret apartment became a deeply meaningful place for all involved."    
 [29] "A man seeks to break free from his predetermined path, a cop questions his wife's demeanor after her return from a supposed drowning, and a woman searches for an extraordinary individual prophesied to become a renowned spiritual guide."      
 [30] "Tashi, a former tennis prodigy turned coach, transformed her husband into a champion. But to overcome a recent losing streak and redeem himself, he'll need to face off against his former best friend and Tashi's ex-boyfriend."                 
 [31] "When the discovery of an ancient artifact unleashes an evil force, Ghostbusters new and old must join forces to protect their home and save the world from a second ice age."                                                                     
 [32] "In 1980s Hollywood, adult film star and aspiring actress Maxine Minx finally gets her big break. But as a mysterious killer stalks the starlets of Hollywood, a trail of blood threatens to reveal her sinister past."                            
 [33] "After a catastrophic military disaster, the dead don't just rise - they hunt. Ava searches for her missing husband, but what she finds is far more terrifying."                                                                                   
 [34] "A woman married to a former politician during the military dictatorship in Brazil is forced to reinvent herself and chart a new course for her family after a violent and arbitrary act."                                                         
 [35] "Inspired by the incredible true story of a hairdresser who single-handedly rallies an entire community to help a widowed father save the life of his critically ill young daughter."                                                              
 [36] "A former operative of a powerful organization embarks on a brutal campaign for vengeance."                                                                                                                                                        
 [37] "A suburban dad falls hard for his charismatic new neighbor."                                                                                                                                                                                      
 [38] "In pursuit of a serial killer, an FBI agent uncovers a series of occult clues that she must solve to end his terrifying killing spree."                                                                                                           
 [39] "When a woman's first love suddenly reenters her life, her relationship with a charming, but abusive neurosurgeon is upended and she realizes she must learn to rely on her own strength to make an impossible choice for her future."             
 [40] "A family is invited to spend a whole weekend in a lonely home in the countryside, but as the weekend progresses, they realize that a dark side lies within the family who invited them."                                                          
 [41] "Two rival fixers cross paths when they're both called in to help cover up a prominent New York official's misstep. Over one explosive night, they'll have to set aside their petty grievances and their egos to finish the job."                  
 [42] "After an unusual encounter, a talented chef and a recently divorcée fall in love and build the home and family they've always dreamed of, until a painful truth puts their love story to the test."                                               
 [43] "After a group of criminals kidnap the ballerina daughter of a powerful underworld figure, they retreat to an isolated mansion, unaware that they're locked inside with no normal little girl."                                                    
 [44] "In 1985 devoted father Bill Furlong discovers disturbing secrets kept by the local convent and uncovers shocking truths of his own."                                                                                                              
 [45] "After being snatched from the Green Place of Many Mothers, while the tyrants Dementus and Immortan Joe fight for power and control, the young Furiosa must survive many trials as she puts together the means to find her way home."              
 [46] "A young man took over his father's real-estate business in 1970s and '80s New York, and got the helping hand of an infamous closeted gay lawyer who helped him turn this young man into a notorious legend. Based on true events."                
 [47] "When a hitman is diagnosed with a terminal illness, he decides to take a hit out on himself. But when the very hitmen he hired also target his ex-girlfriend, he must fend off an army of assassin colleagues."                                   
 [48] "Follows two robbers who must go on the run with the help of one of their therapists after a theft doesn't go as planned."                                                                                                                         
 [49] "Cat is a solitary animal, but as its home is devastated by a great flood, he finds refuge on a boat populated by various species, and will have to team up with them despite their differences."                                                  
 [50] "Solène, a 40-year-old single mom, begins an unexpected romance with 24-year-old Hayes Campbell, the lead singer of August Moon, the hottest boy band on the planet."                                                                              
 [51] "Ex-UFC fighter Dalton takes a job as a bouncer at a Florida Keys roadhouse, only to discover that this paradise is not all it seems."                                                                                                             
 [52] "Desperate to reclaim her career, once-beloved actress Samantha Lake is drawn into the glamorous world of wellness mogul Zoe Shannon -only to uncover a monstrous truth beneath its flawless surface."                                             
 [53] "Kate Carter, a retired tornado-chaser and meteorologist, is persuaded to return to Oklahoma to work with a new team and new technologies."                                                                                                        
 [54] "Kraven's complex relationship with his ruthless father, Nikolai Kravinoff, starts him down a path of vengeance with brutal consequences, motivating him to become not only the greatest hunter in the world, but also one of its most feared."    
 [55] "When an 8-year-old girl disappears on Caddo Lake, a series of past deaths and disappearances begin to link together, altering a broken family's history."                                                                                         
 [56] "A series of bank robberies and car heists frightens communities in the Pacific Northwest. A lone FBI agent believes that the crimes were not the work of financially motivated criminals but rather a group of dangerous domestic terrorists."    
 [57] "While serving as a juror in a high-profile murder trial, a family man finds himself struggling with a serious moral dilemma, one he could use to sway the jury verdict and potentially convict or free the wrong killer."                         
 [58] "A mysterious traveler blackmails a young TSA agent into letting a dangerous package slip through security and onto a Christmas Eve flight."                                                                                                       
 [59] "After escaping from an island prison where he spent 14 years for being wrongly accused of state treason, Edmond Dantès returns as the count of Monte Cristo to exact revenge on the men who betrayed him."                                        
 [60] "A struggling father purchases a domestic SIM to help care for his family, unaware she will gain self-awareness."                                                                                                                                  
 [61] "Bailey lives with her brother Hunter and her father Bug, who raises them alone in a squat in northern Kent. Bug doesn't have much time to devote to them. Bailey looks for attention and adventure elsewhere."                                    
 [62] "A young artist gets stranded in an extensive, immaculate forest in western Ireland, where, after finding shelter, she becomes trapped alongside three strangers, stalked by mysterious creatures each night."                                     
 [63] "A former Marine grapples his way through a web of small-town corruption when an attempt to post bail for his cousin escalates into a violent standoff with the local police chief."                                                               
 [64] "After his father's death, Eric Zimmerman travels to Spain to oversee his company's branches. In Madrid, he falls for Judith and engage in an intense, erotic relationship full of exploration. Eric fears his secret may end their affair."       
 [65] "Cecilia, a woman of devout faith, is warmly welcomed to the picture-perfect Italian countryside where she is offered a new role at an illustrious convent. But it becomes clear to Cecilia that her new home harbors dark and horrifying secrets."
 [66] "Four interconnected stories set in 1987 Oakland, CA. will tell about the love of music, movies, people, places and memories beyond our knowable universe."                                                                                        
 [67] "Eddie Brock and Venom must make a devastating decision as they're pursued by a mysterious military man and alien monsters from Venom's home world."                                                                                               
 [68] "A teenager investigating mysterious murders in his small town discovers a collection of VHS tapes that could reveal the identity of a notorious serial killer."                                                                                   
 [69] "An infamous bounty hunter returns to her childhood home, the chaotic planet Pandora, and forms an unlikely alliance with a team of misfits to find the missing daughter of the most powerful man in the universe."                                
 [70] "A sequel that features Riley entering puberty and experiencing brand new, more complex emotions as a result. As Riley tries to adapt to her teenage years, her old emotions try to adapt to the possibility of being replaced."                   
 [71] "A young woman named Sam finds herself trapped in New York City during the early stages of an invasion by alien creatures with ultra-sensitive hearing."                                                                                           
 [72] "After receiving an unexpected call from her wayfinding ancestors, Moana must journey to the far seas of Oceania and into dangerous, long-lost waters for an adventure unlike anything she's ever faced."                                          
 [73] "A dutiful damsel agrees to marry a handsome prince, only to find the royal family has recruited her as a sacrifice to repay an ancient debt."                                                                                                     
 [74] "Marketing maven Kelly Jones wreaks havoc on NASA launch director Cole Davis's already difficult task. When the White House deems the mission too important to fail, the countdown truly begins."                                                  
 [75] "Sonic, Knuckles, and Tails reunite against a powerful new adversary, Shadow, a mysterious villain with powers unlike anything they have faced before. With their abilities outmatched, Team Sonic must seek out an unlikely alliance."            
 [76] "Struggling with his dual identity, failed comedian Arthur Fleck meets the love of his life, Harley Quinn, while incarcerated at Arkham State Hospital."                                                                                           
 [77] "Buried secrets of an 1870s Montana town spark violence when a young man returns to reclaim his legacy and is caught between a sheriff determined to maintain order and a mysterious stranger hell-bent on destroying it."                         
 [78] "Mismatched cousins reunite for a tour through Poland to honor their beloved grandmother, but their old tensions resurface against the backdrop of their family history."                                                                          
 [79] "A single father and two women venture from the safety of their homes to face monstrous creatures to save the life of a young boy."                                                                                                                
 [80] "About to embark on a world tour, global pop sensation Skye Riley begins experiencing increasingly terrifying and inexplicable events. Overwhelmed by the escalating horrors and the pressures of fame, Skye is forced to face her past."          
 [81] "A young girl who goes through a difficult experience begins to see everyone's imaginary friends who have been left behind as their real-life friends have grown up."                                                                              
 [82] "Forced to confront her past, Cassandra Webb, a Manhattan paramedic that may have clairvoyant abilities, forms a relationship with three young women destined for powerful futures, if they can survive their threatening present."                
 [83] "Karsh, an innovative businessman and grieving widower, builds a device to connect with the dead inside a burial shroud."                                                                                                                          
 [84] "When their late police captain gets linked to drug cartels, wisecracking Miami cops Mike Lowrey and Marcus Burnett embark on a dangerous mission to clear his name."                                                                              
 [85] "After their car breaks down in an eerie small town, a young couple is forced to spend the night in a remote cabin. Panic ensues as they are terrorized by three masked strangers who strike with no mercy and seemingly no motive."               
 [86] "Many years after the reign of Caesar, a young ape goes on a journey that will lead him to question everything he's been taught about the past and make choices that will define a future for apes and humans alike."                              
 [87] "Depicts 8-year-old Bobo's life on her family's Rhodesian farm during the Bush War's final stages. It explores the family's bond with Africa's land and the war's impact on the region and individuals through Bobo's perspective."                
 [88] "Gru, Lucy, Margo, Edith, and Agnes welcome a new member to the family, Gru Jr., who is intent on tormenting his dad. Gru faces a new nemesis in Maxime Le Mal and his girlfriend Valentina, and the family is forced to go on the run."           
 [89] "Reclusive gym manager Lou falls hard for Jackie, an ambitious bodybuilder headed through town to Vegas in pursuit of her dream. But their love ignites violence, pulling them deep into the web of Lou's criminal family."                        
 [90] "In 1950s Mexico City, an American immigrant in his late forties leads a solitary life amidst a small American community. However, the arrival of a young student stirs the man into finally establishing a meaningful connection with someone."   
 [91] "Detectives are thrust into a chilling hunt for \"Mr. Shiny\"-a sadistic serial killer from the past whose return marks the beginning of a new wave of grotesque, otherworldly crimes tied to a dark cosmic force."                                
 [92] "A 19th-century widow has to make an impossible choice when, during an especially cruel winter, a foreign ship sinks off the coast of her Icelandic fishing village."                                                                              
 [93] "After Po is tapped to become the Spiritual Leader of the Valley of Peace, he needs to find and train a new Dragon Warrior, while a wicked sorceress plans to re-summon all the master villains whom Po has vanquished to the spirit realm."       
 [94] "May 1984. An unemployed ice cream truck driver steps onto the game show Press Your Luck harboring a secret: the key to endless money. But his winning streak is threatened when the bewildered executives uncover his real motivations."          
 [95] "A mercenary takes on the job of tracking down a target on a plane but must protect that target when they're surrounded by people trying to kill both of them."                                                                                    
 [96] "In the near future where parenthood is strictly controlled, a couple's seven-day assessment for the right to have a child unravels into a psychological nightmare."                                                                               
 [97] "Muriel and her husband Lee are about to begin a bright new life, which is upended by the arrival of Lee's brother. Muriel embarks on a secret life, gambling on racehorses and discovering a love she never thought possible."                    
 [98] "The city of New Rome faces the duel between Cesar Catilina, a brilliant artist in favor of a Utopian future, and the greedy mayor Franklyn Cicero. Between them is Julia Cicero, with her loyalty divided between her father and her beloved."    
 [99] "A teenager just trying to make it through life in the suburbs is introduced by a classmate to a mysterious late-night TV show."                                                                                                                   
[100] "Chronicles a multi-faceted, 15-year span of pre-and post-Civil War expansion and settlement of the American west."                                                                                                                                

3.3 Runtime

  • Retrieve runtime strings
# Using CSS selectors to scrap the Movie runtime section
runtime_text <- webpage |>
  html_elements('.dli-title-metadata-item:nth-child(2)') |>
  html_text() |>
  print()
  [1] "1h 51m" "2h 21m" "1h 45m" "1h 45m" "2h 19m" "1h 52m" "2h 9m"  "2h 46m"
  [9] "1h 34m" "3h 36m" "1h 49m" "1h 42m" "1h 42m" "2h 40m" "2h 21m" "2h 28m"
 [17] "2h 12m" "2h 8m"  "1h 51m" "1h 39m" "1h 54m" "2h 6m"  "2h 2m"  "1h 37m"
 [25] "1h 59m" "2h"     "2h 19m" "1h 32m" "2h 44m" "2h 11m" "1h 55m" "1h 43m"
 [33] "1h 35m" "2h 17m" "1h 58m" "1h 45m" "1h 40m" "1h 41m" "2h 10m" "1h 50m"
 [41] "1h 48m" "1h 48m" "1h 49m" "1h 38m" "2h 28m" "2h 2m"  "1h 44m" "1h 41m"
 [49] "1h 25m" "1h 55m" "2h 1m"  "1h 40m" "2h 2m"  "2h 7m"  "1h 43m" "1h 56m"
 [57] "1h 54m" "1h 59m" "2h 58m" "1h 46m" "1h 59m" "1h 42m" "2h 11m" "1h 54m"
 [65] "1h 29m" "1h 47m" "1h 50m" "1h 42m" "1h 41m" "1h 36m" "1h 39m" "1h 40m"
 [73] "1h 50m" "2h 12m" "1h 50m" "2h 18m" "1h 35m" "1h 30m" "1h 31m" "2h 7m" 
 [81] "1h 44m" "1h 56m" "2h"     "1h 55m" "1h 31m" "2h 25m" "1h 39m" "1h 34m"
 [89] "1h 44m" "2h 17m" "1h 34m" "1h 29m" "1h 34m" "1h 31m" "1h 42m" "1h 54m"
 [97] "1h 59m" "2h 18m" "1h 40m" "3h 1m" 
  • Hours and minutes:
# hours
runtime_hour <- runtime_text |>
  str_extract("\\d+(?=h)") |>
  as.integer() |>
  print()
  [1] 1 2 1 1 2 1 2 2 1 3 1 1 1 2 2 2 2 2 1 1 1 2 2 1 1 2 2 1 2 2 1 1 1 2 1 1 1
 [38] 1 2 1 1 1 1 1 2 2 1 1 1 1 2 1 2 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
 [75] 1 2 1 1 1 2 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1 3
# minutes
runtime_min <- runtime_text |>
  str_extract("\\d+(?=m)") |>
  # replace NA by 0
  str_replace_na("0") |>
  as.integer() |>
  print()
  [1] 51 21 45 45 19 52  9 46 34 36 49 42 42 40 21 28 12  8 51 39 54  6  2 37 59
 [26]  0 19 32 44 11 55 43 35 17 58 45 40 41 10 50 48 48 49 38 28  2 44 41 25 55
 [51]  1 40  2  7 43 56 54 59 58 46 59 42 11 54 29 47 50 42 41 36 39 40 50 12 50
 [76] 18 35 30 31  7 44 56  0 55 31 25 39 34 44 17 34 29 34 31 42 54 59 18 40  1
  • Runtime in minutes
runtime_data <- (runtime_hour * 60 + runtime_min) |> print()
  [1] 111 141 105 105 139 112 129 166  94 216 109 102 102 160 141 148 132 128
 [19] 111  99 114 126 122  97 119 120 139  92 164 131 115 103  95 137 118 105
 [37] 100 101 130 110 108 108 109  98 148 122 104 101  85 115 121 100 122 127
 [55] 103 116 114 119 178 106 119 102 131 114  89 107 110 102 101  96  99 100
 [73] 110 132 110 138  95  90  91 127 104 116 120 115  91 145  99  94 104 137
 [91]  94  89  94  91 102 114 119 138 100 181

3.4 Film rating

  • Film rating:
filmrating_data <- webpage |>
  html_elements('.dli-title-metadata-item:nth-child(3)') |>
  html_text() |>
  str_replace("Unrated", "Not Rated") |>
  print()
 [1] "R"         "R"         "PG-13"     "PG-13"     "R"         "R"        
 [7] "R"         "PG-13"     "R"         "R"         "R"         "PG"       
[13] "R"         "PG"        "R"         "R"         "R"         "R"        
[19] "R"         "R"         "R"         "PG-13"     "R"         "R"        
[25] "R"         "PG"        "PG-13"     "TV-MA"     "R"         "R"        
[31] "PG-13"     "R"         "R"         "PG-13"     "PG"        "R"        
[37] "R"         "R"         "PG-13"     "R"         "R"         "R"        
[43] "R"         "PG-13"     "R"         "R"         "R"         "R"        
[49] "PG"        "R"         "R"         "R"         "PG-13"     "R"        
[55] "PG-13"     "R"         "PG-13"     "PG-13"     "Not Rated" "R"        
[61] "R"         "PG-13"     "TV-MA"     "R"         "R"         "PG-13"    
[67] "Not Rated" "PG-13"     "PG"        "PG-13"     "PG"        "PG-13"    
[73] "PG-13"     "PG"        "R"         "R"         "R"         "R"        
[79] "R"         "PG"        "PG-13"     "R"         "R"         "R"        
[85] "PG-13"     "R"         "PG"        "R"         "R"         "R"        
[91] "R"         "PG"        "R"         "R"         "R"         "R"        
[97] "R"         "PG-13"     "R"        

Movie 64 has no film rating, we label it as “Not Rated”.

filmrating_data <- append(filmrating_data, "Not Rated", after = 63)
filmrating_data
  [1] "R"         "R"         "PG-13"     "PG-13"     "R"         "R"        
  [7] "R"         "PG-13"     "R"         "R"         "R"         "PG"       
 [13] "R"         "PG"        "R"         "R"         "R"         "R"        
 [19] "R"         "R"         "R"         "PG-13"     "R"         "R"        
 [25] "R"         "PG"        "PG-13"     "TV-MA"     "R"         "R"        
 [31] "PG-13"     "R"         "R"         "PG-13"     "PG"        "R"        
 [37] "R"         "R"         "PG-13"     "R"         "R"         "R"        
 [43] "R"         "PG-13"     "R"         "R"         "R"         "R"        
 [49] "PG"        "R"         "R"         "R"         "PG-13"     "R"        
 [55] "PG-13"     "R"         "PG-13"     "PG-13"     "Not Rated" "R"        
 [61] "R"         "PG-13"     "TV-MA"     "Not Rated" "R"         "R"        
 [67] "PG-13"     "Not Rated" "PG-13"     "PG"        "PG-13"     "PG"       
 [73] "PG-13"     "PG-13"     "PG"        "R"         "R"         "R"        
 [79] "R"         "R"         "PG"        "PG-13"     "R"         "R"        
 [85] "R"         "PG-13"     "R"         "PG"        "R"         "R"        
 [91] "R"         "R"         "PG"        "R"         "R"         "R"        
 [97] "R"         "R"         "PG-13"     "R"        

3.5 Votes

  • Vote data
votes_data <- webpage |>
  html_elements('.ipc-rating-star--voteCount') |>
  html_text() |>
  str_extract("\\d+(,\\d+)*") |>
  as.numeric() |>
  print()
  [1]  57 400 159 170 262  29  39 713  10 132 263 213 125 224 113 288 258 553
 [19] 216  24  78 250 153   8 280 250  98   2  69 174 104  84   2 144  19 176
 [37]  37 217  97 128 100  70 113  35 314  71  20  42 122  82 181   3 191  78
 [55]  53  60 122 190  46  35  12  67 100   2  59  16 150 247  56 243 171 122
 [73] 117  57  77 179   5 138  29 135  65 112   8 111  28 166   1  76  63  29
 [91]   4   8  79   7  25  22   5  41  43  44

3.6 User rating

  • User rating:
userrating_data <- webpage |>
  html_elements('.ipc-rating-star--rating') |>
  html_text() |>
  as.numeric() |>
  print()
  [1] 7.3 7.2 5.8 6.6 7.4 7.0 6.5 8.4 5.6 7.2 7.0 8.2 6.5 7.3 7.3 6.5 7.1 7.5
 [19] 7.0 5.5 5.8 6.8 6.8 4.7 7.1 7.4 5.6 7.0 6.4 7.0 6.1 6.2 6.0 8.1 7.4 6.3
 [37] 6.6 6.6 6.3 6.8 6.5 7.0 6.5 6.7 7.5 7.1 5.8 6.2 7.9 6.3 6.2 4.9 6.5 5.5
 [55] 6.8 6.8 7.0 6.5 7.6 5.4 7.0 5.7 6.8 4.0 5.8 6.3 6.0 4.7 4.7 7.5 6.3 6.3
 [73] 6.1 6.6 6.9 5.2 5.5 7.0 5.6 6.7 6.4 4.1 5.7 6.5 4.7 6.8 6.9 6.2 6.6 6.4
 [91] 6.3 5.7 6.3 6.2 6.4 6.6 6.0 4.7 5.8 6.6

3.7 Metascore

  • We encounter the issue of missing data when scraping metascore.

  • We see there are only 96 meta scores. 4 movies don’t have meta scores. We may manually find which movies don’t have meta scores but that’s tedious and not reproducible.

# Using CSS selectors to scrap the metascore section
ms_data <- html_elements(webpage, '.metacritic-score-box') |>
  html_text() |>
  as.integer() |>
  print()
 [1] 67 78 52 62 91 70 58 79 63 90 75 85 66 73 70 64 78 56 71 56 79 73 55 64 79
[26] 35 82 64 82 46 64 61 85 57 53 72 77 53 66 60 59 62 82 79 64 36 48 87 67 57
[51] 55 65 35 55 75 72 69 75 74 46 76 57 58 41 26 73 68 58 46 53 56 45 48 85 50
[76] 67 46 26 72 54 43 66 74 52 77 72 67 64 54 68 59 62 63 55 86 49
length(ms_data)
[1] 96
  • First let’s tally title (no missingness) and corresponding metascore (if present).
rank_and_metascore <- webpage |>
  html_elements('.ipc-title-link-wrapper .ipc-title__text , .metacritic-score-box') |>
  html_text() |>
  # remove anything after the space
  str_remove(" .*") |>
  print()
  [1] "1."   "67"   "2."   "78"   "3."   "52"   "4."   "62"   "5."   "91"  
 [11] "6."   "70"   "7."   "58"   "8."   "79"   "9."   "63"   "10."  "90"  
 [21] "11."  "75"   "12."  "85"   "13."  "66"   "14."  "73"   "15."  "70"  
 [31] "16."  "64"   "17."  "78"   "18."  "56"   "19."  "71"   "20."  "56"  
 [41] "21."  "79"   "22."  "73"   "23."  "55"   "24."  "25."  "64"   "26." 
 [51] "79"   "27."  "35"   "28."  "82"   "29."  "64"   "30."  "82"   "31." 
 [61] "46"   "32."  "64"   "33."  "61"   "34."  "85"   "35."  "57"   "36." 
 [71] "53"   "37."  "72"   "38."  "77"   "39."  "53"   "40."  "66"   "41." 
 [81] "60"   "42."  "59"   "43."  "62"   "44."  "82"   "45."  "79"   "46." 
 [91] "64"   "47."  "36"   "48."  "48"   "49."  "87"   "50."  "67"   "51." 
[101] "57"   "52."  "55"   "53."  "65"   "54."  "35"   "55."  "55"   "56." 
[111] "75"   "57."  "72"   "58."  "69"   "59."  "75"   "60."  "61."  "74"  
[121] "62."  "46"   "63."  "76"   "64."  "65."  "57"   "66."  "58"   "67." 
[131] "41"   "68."  "69."  "26"   "70."  "73"   "71."  "68"   "72."  "58"  
[141] "73."  "46"   "74."  "53"   "75."  "56"   "76."  "45"   "77."  "48"  
[151] "78."  "85"   "79."  "50"   "80."  "67"   "81."  "46"   "82."  "26"  
[161] "83."  "72"   "84."  "54"   "85."  "43"   "86."  "66"   "87."  "74"  
[171] "88."  "52"   "89."  "77"   "90."  "72"   "91."  "67"   "92."  "64"  
[181] "93."  "54"   "94."  "68"   "95."  "59"   "96."  "62"   "97."  "63"  
[191] "98."  "55"   "99."  "86"   "100." "49"  
# logical vector indicating if the element is a rank
isrank <- str_detect(rank_and_metascore, "\\.$")
# a rank followed by another rank is a missing metascore
ismissing <- isrank[1:(length(rank_and_metascore) - 1)] & isrank[2:(length(rank_and_metascore))]
# last entry is missing or not
ismissing[length(ismissing) + 1] <- isrank[length(isrank)]
# which ranks are missing metascore
missingpos <- as.integer(rank_and_metascore[ismissing])
metascore_data <- rep(NA, 100)
metascore_data[-missingpos] <- ms_data
metascore_data
  [1] 67 78 52 62 91 70 58 79 63 90 75 85 66 73 70 64 78 56 71 56 79 73 55 NA 64
 [26] 79 35 82 64 82 46 64 61 85 57 53 72 77 53 66 60 59 62 82 79 64 36 48 87 67
 [51] 57 55 65 35 55 75 72 69 75 NA 74 46 76 NA 57 58 41 NA 26 73 68 58 46 53 56
 [76] 45 48 85 50 67 46 26 72 54 43 66 74 52 77 72 67 64 54 68 59 62 63 55 86 49

3.8 Visualizing movie data

  • Form a tibble:
# Combining all the lists to form a data frame
movies <- tibble(
  poprank = rank_data, 
  title = title_data,
  description = description_data, 
  runtime = runtime_data,
  filmrating = filmrating_data,
  userrating = userrating_data,
  metascore = metascore_data, 
  votes = votes_data,
) |>
  print(width=Inf)
# A tibble: 100 × 8
   poprank title                  
     <int> <chr>                  
 1       1 The Life of Chuck      
 2       2 The Substance          
 3       3 Trap                   
 4       4 Beetlejuice Beetlejuice
 5       5 Anora                  
 6       6 Relay                  
 7       7 Eden                   
 8       8 Dune: Part Two         
 9       9 Bone Lake              
10      10 The Brutalist          
   description                                                                  
   <chr>                                                                        
 1 A life-affirming, genre-bending story about three chapters in the life of an…
 2 A fading celebrity takes a black-market drug: a cell-replicating substance t…
 3 A father and his teen daughter attend a pop concert only to realize they've …
 4 After a family tragedy, three generations of the Deetz family return home to…
 5 A young stripper from Brooklyn meets and impulsively marries the son of a Ru…
 6 A broker of lucrative payoffs between corrupt corporations and the individua…
 7 Based on a factual account of a group of outsiders who settle on a remote is…
 8 Paul Atreides unites with the Fremen while on a warpath of revenge against t…
 9 A couple's vacation at a secluded estate is upended when they're forced to s…
10 A visionary architect flees post-war Europe in 1947 for a brighter future in…
   runtime filmrating userrating metascore votes
     <dbl> <chr>           <dbl>     <int> <dbl>
 1     111 R                 7.3        67    57
 2     141 R                 7.2        78   400
 3     105 PG-13             5.8        52   159
 4     105 PG-13             6.6        62   170
 5     139 R                 7.4        91   262
 6     112 R                 7          70    29
 7     129 R                 6.5        58    39
 8     166 PG-13             8.4        79   713
 9      94 R                 5.6        63    10
10     216 R                 7.2        90   132
# ℹ 90 more rows
  • Top 5 popular movies:
movies |>
  slice_min(order_by = poprank, n = 5) |>
  print(width = Inf)
# A tibble: 5 × 8
  poprank title                  
    <int> <chr>                  
1       1 The Life of Chuck      
2       2 The Substance          
3       3 Trap                   
4       4 Beetlejuice Beetlejuice
5       5 Anora                  
  description                                                                   
  <chr>                                                                         
1 A life-affirming, genre-bending story about three chapters in the life of an …
2 A fading celebrity takes a black-market drug: a cell-replicating substance th…
3 A father and his teen daughter attend a pop concert only to realize they've e…
4 After a family tragedy, three generations of the Deetz family return home to …
5 A young stripper from Brooklyn meets and impulsively marries the son of a Rus…
  runtime filmrating userrating metascore votes
    <dbl> <chr>           <dbl>     <int> <dbl>
1     111 R                 7.3        67    57
2     141 R                 7.2        78   400
3     105 PG-13             5.8        52   159
4     105 PG-13             6.6        62   170
5     139 R                 7.4        91   262
  • Top 5 user rated movies:
movies |>
  slice_max(order_by = userrating, n = 5) |>
  print(width = Inf)
# A tibble: 5 × 8
  poprank title                    
    <int> <chr>                    
1       8 Dune: Part Two           
2      12 The Wild Robot           
3      34 I'm Still Here           
4      49 Flow                     
5      59 The Count of Monte-Cristo
  description                                                                   
  <chr>                                                                         
1 Paul Atreides unites with the Fremen while on a warpath of revenge against th…
2 After a shipwreck, an intelligent robot called Roz is stranded on an uninhabi…
3 A woman married to a former politician during the military dictatorship in Br…
4 Cat is a solitary animal, but as its home is devastated by a great flood, he …
5 After escaping from an island prison where he spent 14 years for being wrongl…
  runtime filmrating userrating metascore votes
    <dbl> <chr>           <dbl>     <int> <dbl>
1     166 PG-13             8.4        79   713
2     102 PG                8.2        85   213
3     137 PG-13             8.1        85   144
4      85 PG                7.9        87   122
5     178 Not Rated         7.6        75    46
  • Top 5 meta scores:
movies |>
  slice_max(order_by = metascore, n = 5) |>
  print(width = Inf)
# A tibble: 7 × 8
  poprank title            
    <int> <chr>            
1       5 Anora            
2      10 The Brutalist    
3      49 Flow             
4      99 I Saw the TV Glow
5      12 The Wild Robot   
6      34 I'm Still Here   
7      78 A Real Pain      
  description                                                                   
  <chr>                                                                         
1 A young stripper from Brooklyn meets and impulsively marries the son of a Rus…
2 A visionary architect flees post-war Europe in 1947 for a brighter future in …
3 Cat is a solitary animal, but as its home is devastated by a great flood, he …
4 A teenager just trying to make it through life in the suburbs is introduced b…
5 After a shipwreck, an intelligent robot called Roz is stranded on an uninhabi…
6 A woman married to a former politician during the military dictatorship in Br…
7 Mismatched cousins reunite for a tour through Poland to honor their beloved g…
  runtime filmrating userrating metascore votes
    <dbl> <chr>           <dbl>     <int> <dbl>
1     139 R                 7.4        91   262
2     216 R                 7.2        90   132
3      85 PG                7.9        87   122
4     100 PG-13             5.8        86    43
5     102 PG                8.2        85   213
6     137 PG-13             8.1        85   144
7      90 R                 7          85   138
  • How many top 100 movies are in each film rating category?
movies %>%
  count(filmrating)
# A tibble: 5 × 2
  filmrating     n
  <chr>      <int>
1 Not Rated      3
2 PG            11
3 PG-13         22
4 R             62
5 TV-MA          2
# bar plot
ggplot(data = movies) +
  geom_bar(mapping = aes(x = fct_infreq(filmrating))) + 
  labs(y = "count") +
  labs(x = "Film rating", y = "Count")

  • Is there a relationship between user rating and metascore (critics rating)? How to inform the number of votes? Stratify by film rating?
ggplot(data = movies, mapping = aes(x = userrating, y = metascore)) +
  geom_point(mapping = aes(size = votes, color = filmrating)) + 
  geom_smooth() +
  labs(y = "Metascore", x = "User rating")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).

4 Example: Scraping finance data

  • quantmod package contains many utility functions for retrieving and plotting finance data. E.g.,
library(quantmod)
stock <- getSymbols(
  "TSLA", 
  src = "yahoo", 
  auto.assign = FALSE, 
  from = "2020-01-01"
  )
head(stock)
           TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
2020-01-02  28.30000  28.71333 28.11400   28.68400   142981500      28.68400
2020-01-03  29.36667  30.26667 29.12800   29.53400   266677500      29.53400
2020-01-06  29.36467  30.10400 29.33333   30.10267   151995000      30.10267
2020-01-07  30.76000  31.44200 30.22400   31.27067   268231500      31.27067
2020-01-08  31.58000  33.23267 31.21533   32.80933   467164500      32.80933
2020-01-09  33.14000  33.25333 31.52467   32.08933   426606000      32.08933
chartSeries(stock, theme = chartTheme("white"),
            type = "line", log.scale = FALSE, TA = NULL)

5 Example: Pull tweets into R (not working anymore!)

library(twitteR) #load package
  • Step 1: apply for a Twitter developer account. It takes some time to get approved.

  • Step 2: Generate and copy the Twitter App Keys.

consumer_key <- 'XXXXXXXXXX'
consumer_secret <- 'XXXXXXXXXX'
access_token <- 'XXXXXXXXXX'
access_secret <- 'XXXXXXXXXX'
  • Step 3. Set up authentication
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
  • Step 4: Pull tweets
virus <- searchTwitter('#China + #Coronavirus', 
                       n = 1000, 
                       since = '2020-01-01', 
                       retryOnRateLimit = 1e3)
virus_df <- as_tibble(twListToDF(virus))
virus_df %>% print(width = Inf)